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Abstract 

Graphical Gaussian models are popular tools for the estimation of (undirected) gene as- 
sociation networks from microarray data. A key issue when the number of variables greatly 
exceeds the number of samples is the estimation of the matrix of partial correlations. Since the 
(Moore-Penrose) inverse of the sample covariance matrix leads to poor estimates in this sce- 
nario, standard methods are inappropriate and adequate regularization techniques are needed. 
Popular approaches include biased estimates of the covariance matrix and high-dimensional re- 
gression schemes, such as the Lasso and Partial Least Squares. 

In this article, we investigate a general framework for combining regularized regression methods 
with the estimation of Graphical Gaussian models. This framework includes various existing 
methods as well as two new approaches based on ridge regression and adaptive lasso, respec- 
tively. These methods are extensively compared both qualitatively and quantitatively within a 
simulation study and through an application to six diverse real data sets. In addition, all pro- 
posed algorithms arc implemented in the R package "parcor" , available from the R repository 
CRAN. In our simulation studies, the investigated non-sparse regression methods, i.e. Ridge 
Regression and Partial Least Squares, exhibit rather conservative behavior when combined with 
(local) false discovery rate multiple testing in order to decide whether or not an edge is present 
in the network. For networks with higher densities, the difference in performance of the methods 
decreases. For sparse networks, we confirm the Lasso's well known tendency towards selecting 
too many edges, whereas the two-stage adaptive Lasso is an interesting alternative that pro- 
vides sparser solutions. In our simulations, both sparse and non-sparse methods are able to 
reconstruct networks with cluster structures. On six real data sets, we also clearly distinguish 
the results obtained using the non-sparse methods and those obtained using the sparse meth- 
ods where specification of the regularization parameter automatically means model selection. 
Furthermore, for data that violate the assumption of uncorrelated observations (due to repli- 
cations), the Lasso and the adaptive Lasso yield very complex structures, indicating that they 
might not be suited under these conditions. The shrinkage approach is more stable than the 
regression based approaches when using subsampling. 



1 Introduction 



Besides Bayesian networks [13], auto-regressive models [52], and state-space models [26], graphical 
Gaussian models (GGMs) are a popular method for modeling genetic networks based on microarray 
transcriptome data. In the GGM methodology [48], which is considered in the present article, 
networks are represented as undirected graphs. Each vertex represents a gene, and an edge connects 
two genes if they are partially correlated. In contrast to correlation, which measures both direct and 
indirect interactions between pairs of variables, partial correlation measures the strength of direct 
interaction only. Since investigators are primarily interested in direct gene interactions, the GGM 
framework is attractive for modeling of regulatory networks: several recent methodological articles 
report successful applications of GGMs to the estimation of genetic networks from microarray data 
[9,12,20,25,36,37,54]. These approaches are used in numerous applied studies, e.g., for estimating 
Arabidopsis gene networks [21] or for the study of genetically mediated cortical networks [39]. 

Nonetheless, reconstructing GGMs from high-dimensional microarray data remains a difficult 
task. The standard estimation of partial correlations involves either the inversion of the sample 
covariance matrix, or the estimation of p least squares regression problems, where p is the number 
of genes. If the number n of observations (arrays) is much smaller than the number p of variables 
(genes), these approaches are inappropriate. Suitable alternatives are based either on regularized 
estimation of the (inverse) covariance matrix, or on regularized high-dimensional regression. The 
present paper focuses on the latter approach, and presents a comparative study on the use of 
various approaches to high- dimensional regression for covariance selection. The chosen methods 
are extensively compared in simulations and real data studies. Since for real data the ground truth 
(i.e. the true underlying network) is unknown, our performance analysis focuses on the similarities 
and differences between the investigated methods. In particular, we examine the connectivity and 
size of the resulting graphs, as the differences between the estimated networks. Moreover, we 
compare the stability of the methods with respect to subsampling and with respect to violations 
of i.i.d. assumptions. 

In the remainder of this section, we give a brief overview of graphical Gaussian modeling in 
the classical setting with n> p. Subsequently, we discuss the case of high-dimensional data in the 
"Methods" section. 

1.1 Gene Regulatory Networks and Graphical Gaussian Models 

Graphical Gaussian models (GGMs) [48] are fundamental tools in order to represent direct covariate 
interactions. Formally, a GGM is an undirected graph whose nodes represent variables, and whose 
edges represent conditional dependency relations. An edge between two nodes is missing if and only 
if they are conditionally independent given all other nodes. Assuming a joint normal distribution, 
the conditional dependence can be quantified in terms of partial correlations. For a random variable 
X and a finite set of random variables Z = {Zi, . . . , Z^}, the orthogonal complement of X with 
respect to Z is 

X\z = X-VzX, 

where the projection Vz is defined with respect to the inner product (Xi,X2) = i?[XiX2] between 
two random variables Xi and X2. Here, we tacitly assume that all involved moments exist. The par- 
tial correlation pz (Xi, X2) between Xi and X2 with respect to Z is the correlation of the orthogonal 
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complements of Xi and X2 with respect to Z: 

pz(X 1: X 2 ) = cor (X 1 \ Z ,X 2 \ Z ) . (1) 

In the context of gene regulatory networks, each of the p genes is represented by a random variable 
Xj (i = 1, . . . ,p). For each pair of genes we are interested in their partial correlation pij with 

respect to all other genes, i.e. with respect to the set of random variables Z\y = {Xi, . . . ,X P } \ 
{Xi,Xj}. 

Given n observations (arrays) Xi,...,x n € MP of the set of p genes, the standard unbiased 
plug-in estimate for the partial correlation coefficients pij in the case n > p can be formulated in 
two equivalent ways [48] , as outlined below. 

Notations 

In the rest of this article, 

X = (x±, . . . , x n ) T £ M nxp (2) 

denotes the nxp column-centered data matrix with rows corresponding to observations (arrays) 
and columns corresponding to variables (genes). The standard unbiased estimate of the p x p 
covariance matrix I] is then given as 

£ = — — XX. 

n — 1 

Formulation 1: Inversion of the Covariance Matrix 

If the estimate XI is invertible, an unbiased estimate of the partial correlation between genes i and 
j is obtained as 



Pij = —fik^. (3) 



with denoting the inverse of the estimated covariance matrix: 



-1 



ft = (Qij) 

Formulation 2: Least Squares Regression 

Let us consider the p linear regression models 

X, = ^ r : X, • for i = l,...,p, (4) 

where e stands for i.i.d. noise. Note that we do not include an intercept in the model because the 

variables are centered. For i = 1, . . . ,p, the least squares estimate "* = \ ■ ■ ■ , > > • • ■ > Pp^) 
of the vector of regression coefficients is the solution of the optimization problem 

s (0 2 



/3 W = arg min X^-X^/3 (5) 
= (X(V) T X(V)) _1 X^) T XW, (6) 
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where X® £ K" is the ith column of X and X^ G IR nx ( p ^ is the matrix obtained from X by 
deleting the ith column. The partial correlation between genes i and j is then estimated as 

p l3 = sign ^ . (7) 

In the n > p setting, the two regression coefficients /3® and /3p^ always have the same sign. 

Hence, W0j > (3? ' is well-defined. Mor eover, it can be shown that both formulations 1 and 2 are 
equivalent [48] in the sense that they always yield the same estimate. In the n > p setting, a test 
of the null hypothesis p^ = is available using results on the distribution of p^ . 

In microarray data, the number n of samples is typically very small as compared to the number 
p of considered genes. Hence, the above framework is inappropriate for two reasons. Firstly, the 
standard estimate of the partial correlation matrix given by Eqs. (|3j) and ([7]) is not appropriate 
when n < p: in formulation 1, the estimated covariance matrix S is typically ill-conditioned 
or even singular, and its generalized (Moore- Penrose) inverse has large mean squared error [36]. 
In formulation 2, the least squares criterion §5^ is ill-posed and leads to overfitting. Hence, an 
alternative regularized estimate of the partial correlation matrix has to be used in the context of 
GGMs with high-dimensional data. The two formulations 1 and 2 lead to two different strategies 
for the regularized estimation of the partial correlations in the p ^> n setting, which are reviewed 
in the Methods section. 

Secondly, the testing approach mentioned above breaks down in the p n setting, since 
the sampling distribution of estimates under the null hypothesis of zero partial correlation 
is unknown. Two alternatives have been proposed in order to assess statistical significance: (i) 
methods based on sparse estimates of the partial correlation matrix that do not require separate 
testing, and (ii) methods based on empirical null modeling and (local) false discovery rate multiple 
testing [10,37,41]. 



2 Methods 

This section reviews the available strategies for estimating GGMs in the p ^> n setting: biased 
large-scale covariance estimation and regularized regression including our two novel variants (Ridge 
Regression and Adaptive Lasso). 

2.1 Regularized Estimation of the (Inverse) Covariance Matrix 

This approach is derived from formulation 1 . The general approach is to plug a regularized estimate 
of the inverse of the sample covariance matrix into Eq. ([3]). Schafer & Strimmer [36] adopt this 
approach and propose a shrinkage estimator of the covariance matrix. This shrinkage estimator 
is constructed as a convex combination of the unrestricted sample covariance matrix I] and an 
estimator T of a specified low-dimensional sub-model T: 

£ A = AT +(1-A)E, 

where the factor A £ [0, 1] controls the shrinkage intensity. Let us assume a parametrization of 
covariances in terms of correlations and variances, whereas shrinkage is applied to the correlations 
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and diagonal entries are left intact, i.e. the estimator does not shrink the variances. For correla- 
tion shrinkage, we consider the identity matrix as the most commonly employed shrinkage target. 
Notice that the optimal shrinkage intensity A can be determined analytically and be estimated 
from the data. Thus, the resulting correlation shrinkage estimator is positive definite, and favor- 
able properties carry over to derived quantities, such as sample partial correlations. Subsequently, 
model selection of the gene association network can be achieved using empirical null modeling and 
(local) false discovery rate multiple testing [10,37,41]. 

Estimates of the inverse covariance matrix can also be obtained using bootstrap aggregating 
(bagging) as a technique for variance reduction [6] . In some implicit way, the bootstrap procedure 
presumably helps to regularize the problem. However, bagging schemes are inferior to the shrinkage 
estimator [36] , and computationally much more expensive. A recent extension using the augmented 
bootstrap [46] is in fact closely related to the shrinkage estimator [34,42] and is expected to perform 
similarly. 

In this paper, we use the correlation shrinkage based approach as a reference method in com- 
parison with the regression based approaches to covariance selection. 

Finally, recent novel approaches are to be noted that are based on l\ regularized maximum 
likelihood estimation in graphical Gaussian models [8, 12,30,49,54]. Corresponding inverse covari- 
ance estimates exploit the sparsity in the graphical structure and conduct parameter estimation 
and model selection simultaneously. However, despite recent advances in semidefinite programming 
computation remains challenging in practice due to the high-dimensionality and positive definite- 
ness constraint [53]. 



2.2 Regularized Regression 

Here, the strategy is to replace the least squares estimator in Q by some regularized estimator 
of the regression coefficients that can be used in formula ([7]) to obtain estimators of the partial 
correlations. More formally, we define the following class of estimates of the partial correlations. 

Definition 1. For any regression method reg that yields (regularized) estimates /3 rcg of the linear 
regression model Q), we define the corresponding estimate of the partial correlations as 

Pij,ieg = sign (/^Ig) min |l, ^ty%J%jl g \ 



if sign (pj% g ) = sign (/^ r } eg ) 

and otherwise. 

This definition ensures that the estimated partial correlation coefficients are always well-defined 
and that they lie in the interval [—1,1]. Again, we can roughly distinguish between regression 
methods that require testing to construct the undirected graphs, and sparse regression methods. 

In the rest of this subsection, we discuss two regularized regression methods (PLS and the Lasso) 
that have been proposed for the estimation of large-scale GGMs in the literature. Furthermore, we 
propose two additional attractive methods (ridge regression and the adaptive Lasso). 
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Partial Least Squares 

Tenenhaus et. al. [44] suggest Partial Least Squares (PLS) regression [50,51] as a plug-in for Def. 
[T] PLS is a method for supervised dimensionality reduction. It has its seed in the chemometrics 
community, but its success has lead to applications in various other scientific fields, e.g. in chemo- 
and bioinformatics [5,33]. The main idea of PLS is to build a few orthogonal components from 
the original data X^ and to use them as predictors in a least squares fit. A PLS component 
t = X^'w is a linear combination of the original predictors that have maximal covariance with 
the response vector X®, under the additional assumption that the components are mutually 
orthogonal. Formally, the k-th PLS component is defined by 

Wk = arg max cov (x^w, X^^ 

\\w\\=l \ ) 

s.t. w T X^ l)T X^ i) wi =0 for I < k . 

Hence, PLS regularizes the regression problem by compressing the p variables into a small 
number m of orthogonal components T = (t 1; . . . ,t m ) and regressing the response variable onto 
these components. After rescaling the weight vectors {k = 1, . . . ,m) such that t& has length 1, 
this leads to the regression coefficients 

3^ = («!,..., w m )T T X^. 

While the original formulation of PLS scales with the number p of variables, it is also possible 
to represent the algorithm in a way that it only scales with the number n of observations [28,29]. 
This leads to a dramatic decrease in computation time for p 3> n. Note that the number of PLS 
components is a model parameter that has to be optimized for each of the p regression models 
Q. The standard model selection techniques are cross-validation or information criteria based on 
degrees of freedom [18]. In the context of gene regulatory networks, Tenenhaus et.al. [44] propose 
to use the same number of components m for all p regression models. They observe empirically 
that the partial correlation coefficients (Def. [TJ obtained from PLS regression reach a plateau 
when the number of PLS components m increases, and suggest a heuristic procedure to choose the 
smallest m for which the plateau is reached. However, in our experiments, we use the theoretically 
well- funded and popular cross-validation technique with k folds. 

As the PLS coefficients are not sparse, the obtained partial correlations are in general non- 
zero. Thus, a statistical testing procedure has to be used to determine which edges are significant. 
(Alternatively, one might also use a sparsification of PLS as proposed by Chun & Keles [7].) In the 
present article, we use large-scale simultaneous hypothesis testing with local false discovery rate 
(fdr) level 0.2, in order to identify unusual outliers among the estimated partial correlations. 

For the sake of completeness, let us mention in this section a variant of the PLS approach 
described above, which was recently suggested by Pihur et al. [25]. Instead of estimating the 
partial correlation using Eq. ([7| , they propose an alternative measure of correlation strength which 
is very similar to the PLS-based partial correlation coefficient except that, roughly speaking, the 
square root of the product of Pjpi s an d /3^ s is replaced by their sum. We remark that Pihur et. al. 
do not optimize the number of PLS components m and recommend to use m ~ 3. 
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Ridge Regression 

Ridge regression (see e.g. [15]) is probably the most popular and most straightforward regularized 
regression technique. Regularization is performed by adding a penalty term P((3) to the least 
squares criterion (|5| . Ridge regression is based on an £2 penalty term of the form 

P(/3) = A||/3||2 = A]r& 2 , (8) 

i 

where A > denotes the penalty parameter. This leads to a reduction of variance and thus 
avoids overfitting. 

The solution obtained by ridge regression depends on the penalty parameter A. In our paper, 
we use standard £;-fold cross-validation to select the optimal amount of penalization A. As ridge 
regression does not lead to sparse solutions, we use large-scale false discovery rate multiple testing 
[41] to test for significant edges, as described above in the subsection on PLS. Again, we adopt a 
level of 0.2. 

The Lasso 

Meinshausen and Buhlmann [22] propose to estimate the regression coefficients in Def. [T] with 
the Lasso [45] and study under which conditions model selection consistency applies, hinging on 
the choice of the penalty. Similarly to Ridge Regression, the estimated regression coefficients are 
chosen to minimize a penalized least squares criterion. Lasso regression is based on a £i-penalty of 
the form 

P((3) = \\\(3\\ 1 = \Y / \Pi\, (9) 

i 

where A > is the regularization parameter. With the -^-penalty, many estimated regression 
coefficients will be equal to 0. As a result, with variable selection in mind, the Lasso has a major 
advantage: a sparse estimator of the matrix of partial correlations is yielded and a graph can 
be obtained by assigning an edge between two genes if and only if p^iasso 7^ 0. The choice of 
the penalty A has to be determined for each of the p high-dimensional regressions successively. 
Again, this can be done using some cross-validation scheme or information criteria. Meinshausen 
& Buhlmann [22] motivate a choice of the penalty parameter that aims at controlling the probability 
of falsely connecting two nodes in the graph, i.e. that is a choice tailored to the graph structure. 
However, experiments [36] indicate that this approach leads to graphs that are too dense, i.e. too 
many edges are selected. Therefore, in this paper, we use the oracle penalty for optimal prediction 
that is determined using /c-fold cross-validation. 

The two-stage adaptive Lasso 

The Lasso is only asymptotically consistent for covariance selection when requiring certain necessary 
conditions among the variables in the GGM. Zhou et al. [55] show that the two-stage adaptive Lasso 
procedure [56] is consistent for high-dimensional model selection in graphical Gaussian models 
under rather general and less restrictive conditions. The adaptive Lasso [56] considers the Lasso 
with penalty weights as 
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P(/3) = A^|A 

i 



(10) 



where the weights Wi are chosen in a data-dependent manner. Specifically, the adaptive Lasso 
is defined as follows. Suppose (3 is a ^/n consistent initial estimator of (3. For example, we can use 
the least squares estimator (3 ois . Pick a 7 > 0, and define the weights Wi = l/|/9i i0 is| 7 - The most 
common choice is 7 = 1. Here, we use the Lasso estimator /3i asso as initial estimator, and define 
the weights 

Wi = l/|A,lasso|- 

Note that the amount of penalization in both the initial stage Lasso and the second stage 
Lasso with penalty weights is determined via fo-fold cross-validation. The adaptive Lasso will be 
at least as sparse as the Lasso. For graphical Gaussian modeling, the adaptive Lasso estimates 
are used in Def. [T] and two genes are connected if and only if the partial correlation coefficient 
Pij,adaptive lasso 0. We remark that for model selection, the optimal weights have to be determined 
in each of the k cross-validation splits. As the optimal weights themselves are determined via k- 
fold cross-validation, this implies that a lasso fit has to be computed k 2 times! This leads to high 
computational costs. 

3 Results 

In this section, we perform extensive experiments to compare regression-based methods for recon- 
structing gene regulatory networks. We consider the recently proposed techniques PLS regression 
and Lasso regression, and the two additional methods, ridge regression and adaptive Lasso regres- 
sion, that have not been applied in practice for this purpose before. As a reference method, we 
use the shrinkage approach to covariance estimation, followed by matrix inversion. An overview 
of the five considered methods and their respective parameters and characteristic features is given 
in Table 1. All methods are implemented in the R package "parcor" [19], available from the R 
repository CRAN. 

3.1 Simulations 

The performance of the proposed methods is assessed in a simulation study with a set-up similar 
to [36]. The number of variables is fixed at p = 100, and various sample sizes ranging from 25 to 
200 in steps of 25 are investigated. We consider two different scenarios. First, we simulate networks 
with varying degree of density and no network topology, and second, we investigate sparse networks 
with different network topologies. These scenarios correspond to particular choices of the partial 
correlation matrix P (see below). For all experiments, a total of 20 replications are performed for 
each sample size to average out variability due to random sampling. For each replication, the data 
are drawn randomly from a multivariate normal distribution with correlation structure derived 
from P. 
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Varying degree of density 

Partial correlation matrices P of size p x p with a proportion of 

d = 5%; 10%; 15%; 20%; 25% 

non-zero entries are constructed by first drawing the non-zero entries from a uniform distribution 
on [—1, 1] and then rescaling the non-diagonal entries to ensure that we obtain a feasible partial 
correlation matrix (for more details, see the R-package GeneNet [34]). Hence, the range of the non- 
zero partial correlations depend on the density of the network. If the network is rather dense, the 
absolute values of the non-zero partial correlation coefficients are very small compared to a sparse 
network. This is illustrated in the supplementary file histpcor.pdf. Here, we plot the histogram 
of the non-zero partial correlations for a random matrix P of density d. It is important to note 
that due to the small values, the reconstruction of the network becomes more delicate for a higher 
degree of density: it is more difficult to select the correct non-zero entries if their true vales are 
close to zero. We remark that this effect cannot be entirely eliminated by a more clever simulation 
design, and that the simulation of partial correlation matrices is far from trivial [31]. 

For each generated data set, P is then estimated based on PLS regression, ridge regression, the 
Lasso, the adaptive Lasso and the shrinkage covariance estimator, successively. For all regression- 
based methods, k = 5-fold cross-validation is used to optimize the model parameters, i.e. the 
number of components m for PLS and the penalty A for ridge regression, the Lasso and the two-stage 
adaptive Lasso, respectively. For the Lasso and the adaptive Lasso, we follow the parametrization 
implemented in the lars package [14], based on the ratio of the £i-norm of the Lasso and the 
£i-norm of the least squares estimates. Specifically, the regularization parameter is chosen from an 
equidistant sequence between and 1 of length 1000. Furthermore, we normalize this parameter 
to avoid the peaking phenomenon at n = p (see [17] for details). For ridge regression, we consider 
a logarithmically spaced sequence Zi, . . . , Ziooo ranging from 10~ 10 to 10 _1 . The candidate penalty 
parameters are then defined as \ s = l s np (with s = 1, . . . , 1000). Finally, the range of the number 
of PLS components is from 1 to 15. 

We evaluate the accuracy of the resulting estimators in two respects: (i) the estimation error 
of the partial correlation matrix itself, and (ii) the recovery of the underlying networked topology. 
The difference between the estimated and true matrix of partial correlations is measured in terms 
of the mean squared error (MSE). In the upper left panel of Figures la-e, the MSE is displayed as 
a function of the sample size n. 

For sparse networks, the two sparse estimates based on the Lasso and the adaptive Lasso, 
respectively, yield a lower MSE compared to the three other methods that are not sparse and 
are likely to contain many non-zero but non-significant (small) entries, which ultimately lead to a 
higher MSE. This effect vanishes for higher degrees of density. A notable exception is PLS. For 
denser networks, the MSE becomes larger. These networks correspond to small absolute values of 
the entries in P. Therefore, we conjecture that PLS is not able to shrink the regression coefficients 
enough, as the regularization parameter m (number of components) is discrete. This is in contrast 
to the four other methods. Note however that for the reconstruction of the underlying networked 
topology the MSE is only of secondary interest. 

For each investigated sample size, the resulting number of selected edges is displayed in the 
upper right panel of Figures la-e, while the horizontal line is the number of true edges. For sparse 
networks, the Lasso with its regularization parameter chosen to be prediction optimal tends to 
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select too many edges. PLS, ridge regression and the approach based on shrinkage covariance 
estimation are in contrast far more conservative and rather select too few edges, even in the n > p 
case. The adaptive Lasso is less conservative and appears to be a promising alternative. Again, 
these differences vanish for higher degrees of densities. As remarked above, the reconstruction task 
becomes more difficult for higher degrees of density. This explains the low number of selected edges 
for higher degrees of density. 

The two lower panels in Figures la-e correspond to the power (left) and the true discovery rate 
(tdr, right) which are defined as 



#{true edges that are selected} 

power = — r and 

#{true edges) 

^ #{true edges that are selected} 

^{selected edges} 

respectively. The panels illustrate that for sparse networks, the Lasso's comparatively high 
power comes at the prize of rather low true discovery rate. Again, the power decreases with the 
increase in density of the network. In many practical applications, we argue that it might be more 
valuable to report more stable results with fewer false positives. 

However, it is to be noted that the non-sparse methods using fdr-based procedures for edge 
selection involve an arbitrary parameter: the fdr threshold (here 0.2). These methods can thus 
be made more or less sparse by changing the threshold value. To investigate the relative accuracy 
of the non-sparse methods independently of the particular fdr threshold, the same simulations 
are subsequently performed with other thresholds. In order to evaluate the ability of the three 
methods to detect non-zero partial correlations, their sensitivity and specificity are computed for 
these different fdr thresholds and displayed graphically in form of ROC curves. These can be found 
in the supplementary material. PLS and ridge regression yield very similar results. They slightly 
outperform the approach based on shrinkage covariance estimation. The sensitivity and specificity 
of the Lasso and the adaptive Lasso, which do not depend on a particular threshold, are depicted 
as single points. They are above the ROC curves of the three non-sparse methods, indicating good 
performance - especially for the adaptive Lasso. 

Finally, we compare the runtime of the respective methods in Figure 2. Note that we do not 
display the runtime of the Lasso, as it was computed as an intermediate step in the R- function for 
the adaptive Lasso. The upper part of Figure 2 clearly shows that the computational load for the 
adaptive Lasso is very high. This is due to the fact that we have to run the lasso algorithm k 2 
times in /c-fold cross-validation, and that the (adaptive) lasso algorithm scales unfavorably in the 
number of variables - in contrast to PLS, Ridge Regression or shrinkage. The lower part of Figure 
2 only displays the runtime of the three latter methods. Shrinkage is faster than the regression 
based approaches as it circumvents both time-consuming cross-validation and the computation of 
p different regression models. The discrepancy with respect to the runtime becomes even more 
apparent in the real data study (see below). 



Different network topologies 

Next, we consider different network topologies. We simulate two different types of topologies, 
which are displayed in the supplementary file clusters.pdf. The left part of the figure shows 
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three clusters of genes. In each cluster, all genes are partially correlated, and genes from different 
clusters are not partially correlated. In the simulation, we consider networks with 1,2 and 3 clusters. 
The right part of the figure in clusters.pdf shows three star-shaped clusters. In each star, all 
genes are partially correlated to one gene, the center of the star. In the simulation, we consider a 
network with 3 stars. The MSE, the number of selected edges, the power and the true discovery 
rate are displayed in Figures 3a-d. Again, we observe a high MSE for PLS in most scenarios. 
As explained above, this is probably due to the insufficient shrinkage of PLS towards 0. Overall, 
the Lasso and Ridge Regression perform best in these scenarios. So, in contrast to what is often 
conjectured/reported in the literature, we do find in our simulations that sparse methods are able 
to reconstruct networks in the presence of cluster structures. 

3.2 Real Data Study 

We compare the five different methods on diverse real world data sets: the ecolil [16] and ecoli2 
[38], Ara [40], t.celllO and t.cell34 [26], and west [47] data sets. All data sets are freely 
available. An overview of the size, characteristics and availability of the data sets is given in Table 
2. The five considered methods (shrinkage covariance estimation, ridge regression, PLS, Lasso, 
adaptive Lasso) including the model selection procedures for the regression-based approaches arc 
exactly as in the simulation setting. For ecoli2, we use leave-one-out-cross-validation for model 
selection, and for west, we use k = 5-fold cross-validation. For the remaining 4 data sets, we use 
k = 10. 

In real world scenarios, the ground truth, i.e. the true underlying network, is hardly ever known, 
and the performance of different methods cannot be determined in terms of MSE, power and tdr 
as in the simulation study. Nevertheless, it is possible to compare the performance of the different 
methods quantitatively. In particular, we investigate the size and the connectivity of the estimated 
graphs, their overlap the type of interaction between genes and their stability. 

Figures 4a and 4b display the percentage of selected edges for each data set. As in the simulation 
study, the proportion of selected edges strongly depends on the chosen estimation method. More 
surprisingly, the relative levels of sparsity of the obtained graphs show very different patterns for 
the six investigated data sets. The Lasso and adaptive Lasso seem to behave very differently from 
the other methods. This can at least partly be explained by the fact that they rely on a completely 
different edge selection scheme which essentially depends on the sparsity of the regression method 
and not on the testing scheme. 

In a nutshell, the Lasso and adaptive Lasso select less edges than the other methods for all data 
sets except for the two data sets t . celllO and t . cell34 with repeated measurements. With these 
two data sets, Lasso and adaptive Lasso yield complex graphs with as much as over 50 % non-zero 
edges (t.cell34 data). This behavior is likely to be due to the longitudinal structure of the data 
that is not explicitly considered, since the standard Lasso regression method assumes independent 
observations. In contrast, longitudinal structures may be handled in some implicit way by methods 
using an fdr-based assessment, where the distribution under the null hypothesis is estimated from 
the data. 

PLS reconstructs a very dense network for the data sets ecolil, ara and west. In combination 
with the high MSE that we observed in the simulations, we conjecture that PLS in combination with 
cross-validation is not the most reliable method for the reconstruction of networks. We believe that 
other model selection strategies or the incorporation of sparse PLS [7] may improve the performance 
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of PLS. 



Among the three methods with fdr-based assessment of the edges, i.e PLS, ridge regression and 
shrinkage covariance estimation, the latter procedure seems to be most conservative, whereas PLS 
identifies the highest number of edges. This result is consistent for all six real data sets and yields 
a refinement of the results presented in the simulation study, where these three methods performed 
similarly. 

Table 3 displays the overlap of the estimated graphs. The estimated graphs show a moderate 
overlap between the methods. While considering these results, one should keep in mind that the 
proportions of selected edges vary a lot across the five methods, which of course decreases the 
overlap considerably: a very sparse graph can obviously include only a very small proportion of the 
edges of a more complex graph. Interestingly, the overlap seems to be higher on average for the west 
data set including the highest number of genes than for the other five data sets. We remark that 
the Lasso and adaptive Lasso solutions are computed based on different, random cross-validation 
splits. This explains that, in general, the graph found by adaptive Lasso is not exactly a subgraph 
of the solution found by Lasso. 

Figures 5a and 5b display the connectivity of the estimated graphs for each of the six data sets. 
For each gene, we derive the proportion of genes that are connected to it through an edge, with 
each of the six data sets and each of the five methods. Each boxplot depicts the distribution of the 
proportion of connected genes for the considered method and the considered data set. As explained 
above, the assumption of i.i.d. observations is violated for the data sets t.celllO and t.cell34 . 
This leads to a high number of selected edges for the Lasso and adaptive Lasso (see figures 4a and 
4b), and consequently to a high number of connected genes for these methods. 

Figure 6 displays the percentage of positive (> 0) correlations among the edges identified by 
the five methods for the six data sets. This proportion varies between 0.5 and 0.8. The results 
obtained using the five investigated methods seem much more consistent than the results on the 
number of identified edges. 

We also compare the methods with respect to their stability. This is an important issue in order 
to assess the reliability of the methods [4,32] : a good method is expected to yield a stable network in 
the sense that a slightly modified data set (for instance a subsample) does not lead to a completely 
different result. For data sets ecolil, ecoli2, t.celllO and t.cell34, we draw subsamples by 
excluding w 10% of the observations and compute the network based on each subsample using the 
five methods successively. The number of considered subsamples is fixed to R = 10 (only R = 9 for 
the data set ecoli2 that includes 9 observations). We do not analyze the data sets ara and west, 
because repeated experiments would be computationally too expensive. 

For each candidate edge i, counts how often this edge is selected across the R subsamples. 
Similarly, nf^ = R—n'p denotes the number of times the ith edge is not selected. These frequencies 
are summarized using Fleiss' K-score [11] which measures the degree of agreement among the R 
subsamples of the data. The measure is defined as follows. We first compute the average proportion 
of assignments 




#edges 



P 



,(0 
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Further, the degree of agreement of the R subsamples for the ith edge is measured as 

£(»:") 2 -« 

.1=0 

Finally, with P denoting the average of the Pj's and with P c = (p(°)) 2 + (p^) 2 denoting the 
agreement expected by chance, Fleiss k is defined as 

P-Pc 

K = T^W C - 

The score is always < 1, and the higher the value of k, the more stable the methods are with 
respect to subsampling. 

The K-score of the methods is given in Table 4. As the absolute values are hard to compare 
between data sets, we also display the ranking on each data set. The shrinkage approach is the 
most stable, probably because it does not rely on additional subsampling in form of cross-validation 
splits. The regression based approaches are less stable, but among them, the degree of stability 
is comparable. In particular, in this experiment, we cannot see any difference between sparse and 
non-sparse approaches. 

Finally, the considered methods differ quite dramatically with respect to their run-time. As 
an illustration, we compared the run-time on the west data set, which contains 3883 genes. The 
approach based on shrinkage covariance estimation is by far the most efficient one (~ 2 min), and 
all other methods scale within several hours: PLS ~ 7.5 hours, ridge regression ~ 10 hours, the 
Lasso « 17 hours, and the adaptive Lasso « 3.5 days. This can be seen as a major drawback of 
the methods relying on cross-validation schemes, especially the Lasso-based methods. While Ridge 
Regression and PLS allow a representation that only scales in the number of observations, Lasso 
and adaptive Lasso scale in the number of variables. Furthermore, adaptive Lasso requires nested 
cross-validation. Partial relief may be found in a parallel implementation. Alternatively, for high- 
dimensional data, one might consider to approximate the Lasso-based networks by first constructing 
a mildly sparse network without cross-validation (for example using the method described in [22]), 
and then to refine this network by running the (adaptive) Lasso with cross-validation. 

4 Discussion 

In this paper, we proposed and compared different methods to estimate partial correlation co- 
efficients based on regularized regression techniques with applications to genetic networks. In a 
simulation study, we assessed the performance of the considered methods in terms of estimation 
accuracy (MSE) and in terms of reverse engineering of the true underlying networked topology. As 
a result, the investigated non-sparse methods (PLS, ridge regression, and the approach based on 
shrinkage covariance estimation that served as a reference method) were found to perform similarly. 
It is to be noted that these methods have fdr-based significance testing in common. They are rather 
conservative with respect to the inclusion of edges when used with the standard fdr threshold 0.2. 
The Lasso tends to produce too "dense" structures, while the adaptive Lasso compensates for that 
by selecting edges in a two-step approach, therefore leading to sparser graphs. The latter two-stage 
approach is able to select relevant edges, even for small samples, while at the same time preventing 



R(R + 1) 
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to be too dense. For denser networks, the performances of the five methods are very similar. On 
real world data, the behavior of the non-sparse methods is again similar, except that PLS is less 
conservative than ridge regression and the approach using a shrinkage covariance estimator. A 
remarkable difference with respect to the different data sets is the behavior of the Lasso and the 
adaptive Lasso on the t.cell data sets. In contrast to the four other data sets, the t.cell data include 
replications, thus violating the assumption of independent samples. Consequently, the (adaptive) 
Lasso does not handle the underlying data structure correctly, while empirical null modeling seems 
to account for the decreased "effective" sample size in an implicit way. 

Note that all investigated methods require the specification of tuning parameters that need to be 
optimized based on the available data. The choice of the model selection criterion itself strongly 
influences the results of the methods [2] , especially for small n. As an example, the model selection 
procedure introduces a substantial amount of variation for the Lasso and the adaptive Lasso. In the 
real world study, we estimate the two graphs on two different random cross-validation splits, which 
leads to an overlap of only 88.4% on the west data, although the adaptive Lasso graph is defined 
as a subgraph of the Lasso graph. Hence, tuning parameters should be given much attention in 
future research when new methods are developed. Moreover, setting the parameters to fixed values 
without proper selection procedure (such as cross-validation) and just because they "yield nice 
results" is an incorrect and biased strategy which may favor the proposed novel method. Further- 
more, from a computational point of view, a major strength of the shrinkage approach is that the 
optimal amount of regularization can be estimated from the data using an analytic formula, thus 
making time-consuming cross-validation procedures unnecessary. 

We want to emphasize that there are interesting alternatives for the detection of significant 
edges that do not depend on sparsity penalties or testing based on local false discovery rates. For 
instance, Reverter & Chan [27] propose information theoretic measures for the reconstruction of 
gene co-expression networks. The comparative performance of these methods and their connections 
to the approaches investigated above may be explored in future research. 

Finally, the methods discussed in this paper can potentially be used for detecting causal in- 
teractions [1,24]. For instance, in the presence of longitudinal data, Arnold, et.al. [1] propose 
to identify the direction of interactions between variables by investigating partial correlations be- 
tween time-shifted copies of the variables. Amongst others, they propose to estimate these partial 
correlations using Lasso regression, but other regression methods might be promising alternatives. 

5 Conclusion 

We briefly summarize our findings. 

Performance: In the simulation, the investigated non-sparse regression methods, i.e. Ridge 
Regression and Partial Least Squares, exhibit rather conservative behavior when combined with 
(local) false discovery rate multiple testing in order to decide whether or not an edge is present 
in the network. For networks with higher densities, the difference in performance of the methods 
decreases. Both sparse and non-sparse methods can deal with cluster topologies in the network. 

For PLS, we observe both a high MSE in the simulations and a high percentage of selected 
edges in some of the real data. In our opinion, this is an indication that PLS itself might not be too 
well-suited for the reconstruction of networks. The reasons are that PLS is not sparse by design, 
and that it does not shrink arbitrarily close to zero. Therefore, we suggest to incorporate sparse 
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versions of PLS instead in future research. 

On six real data sets, we also clearly distinguish the results obtained using the non-sparse 
methods and those obtained using the sparse methods where specification of the regularization pa- 
rameter automatically means model selection. For data that violate the assumption of uncorrelated 
observations (due to replications), the Lasso and the adaptive Lasso yield very complex structures, 
indicating that they might not be suited under these conditions. 

Stability: We compared the stability of the methods under two aspects. All regression-based 
methods are less stable than the shrinkage approach over different subsamples of the data, and 
within the regression-based approaches, there is no clear difference between sparse and non-sparse 
methods. However, the two sparse regression methods seem to be unstable with respect to violations 
of the i.i.d assumption of the samples. 

Runtime: The computational load for the Lasso and in particular for the adaptive Lasso is 
considerable. For very high-dimensional data, this can constitute a severe limitation. The runtime 
might be decreased by applying parallel computation techniques or by preselecting a coarse network 
topology that does not rely on cross-validation. While PLS and Ridge Regression are slower than 
shrinkage, both of them are fairly fast to compute, as they allow a kernel representation, i.e. most 
of the computation scales in the number of samples and not in the number of variables. 



Available Software 

The regularized estimation of partial correlations and the construction of gene association networks 
with (adaptive) Lasso, ridge regression and PLS are implemented in the R package parcor [19] which 



is available from the CRAN repository http://cran.r-project.org/ The package relies heavily on 



the lars package [14]. For assigning statistical significane to the edges in the network, we use the 
fdrtool package [43]. An executable sheet for the simulations can be downloaded^] 



Authors' contribution 

NK and ALB initiated the study. NK wrote the initial version of the manuscript. JS and NK 
implemented the R package, NK and ALB performed the analyses. All authors contributed to the 
concept and to the manuscript. 



Acknowledgments 

This work was supported in part by the BMBF grant FKZ 01-IS07007A (ReMind), and the FP7- 
ICT Programme of the European Community, under the PASCAL2 Network of Excellence, ICT- 
216886. Financial support from DSM Nutritional Products Ltd. (Basel, Switzerland) is gratefully 
acknowledged. We thank Lukas Meier and Mikio L. Braun for constructive comments on model 
selection, and Animesh Acharjee for helpful feedback on the R package "parcor" . 

1 http:/ /ml. cs.tu-berlin.de/~nkraemer/software. html 



15 



References 



A. Arnold, Y. Liu, and N. Abe, Temporal Causal Modeling with Graphical Granger Meth- 
ods, Proceedings of the Thirteenth ACM SIGKDD International Conference on Knowledge 
Discovery and Data Mining,, 2007. 

A. L. Boulesteix, A. Kondylis, and N. Kramer, Comment on: Augmenting the bootstrap to 
analyze high dimensional genomic data, TEST 17 (2008), 31-35. 

A.-L. Boulesteix, S. Lambert-Lacroix, J. Peyre, and K. Strimmer, plsgenomics: Pis analyses 
for genomics, 2007, R package version 1.2-2. 

A.-L. Boulesteix and M. Slawski, Stability and aggregation of ranked gene lists, Briefings in 
Bioinformatics 10 (2009), no. 5, 556-68. 

A.-L. Boulesteix and K. Strimmer, Partial Least Squares: a Versatile Tool for the Analysis of 
High- Dimensional Genomic Data, Briefings in Bioinformatics 8 (2007), no. 1, 32-44. 

L. Breiman, Bagging predictors, Machine Learning 24 (1996), 123-140. 

H. Chun and S. Keles, Sparse partial least squares for simultaneous dimension reduction and 
variable selection., Journal of the Royal Statistical Society (2009), to appear. 

A. d'Aspremont, O. Banerjee, and L. El Ghaoui, First-Order Methods for Sparse Covariance 
Selection, SIAM Journal on Matrix Analysis and its Applications 30 (2008), no. 1, 56-66. 

A. Dobra, C. Hans, B. Jones, J.R. Nevins, G. Yao, and M. West, Sparse Graphical Models for 
Exploring Gene Expression Data, Journal of Multivariate Analysis 90 (2004), no. 1, 196-212. 

B. Efron, Large-Scale Simultaneous Hypothesis Testing: the Choice of a Null Hypothesis, Jour- 
nal of the American Statistical Association 99 (2004), 96-104. 

J.L. Fleiss, Measuring nominal scale agreement among many raters, Psychological Bulletin 76 
(1971), no. 5, 378-382. 

J. Friedman, T. Hastie, and R. Tibshirani, Sparse Inverse Covariance Estimation with the 
Graphical Lasso, Biostatistics 9 (2008), no. 3, 432-441. 

N. Friedman, Inferring Cellular Networks using Probabilistic Graphical Models, Science 303 
(2004), no. 5659, 799-805. 

T. Hastie and B. Efron, lars: Least angle regression, lasso and forward stagewise, 2007, R 
package version 0.9-7. 

A.E. Hoerl and R.W. Kennard, Ridge Regression: Biased Estimation for Nonorthogonal Prob- 
lems, Technometrics 42 (2000), no. 1, 80-86. 

K.C. Kao, Y.L. Yang, R. Boscolo, C. Sabatti, V. Roychowdhury, and J.C. Liao, Transcriptome- 
based Determination of Multiple Transcription Regulator Activities in Escherichia Coli by 
Using Network Component Analysis, Proceedings of the National Academy of Sciences 101 
(2004), no. 2, 641-646. 



16 



[17] N. Kramer, On the Peaking Phenomenon in Model Selection for the Lasso, preprint, available 
at http://arxiv.org, 2009. 

[18] N. Kramer and M. L. Braun, Kernelizing PLS, Degrees of Freedom, and Efficient Model Selec- 
tion, Proceedings of the 24th International Conference on Machine Learning (Z. Ghahramani, 
ed.), 2007, pp. 441-448. 

[19] N. Kramer and J. Schafer, parcor: estimation of partial correlations based on regularized re- 
gression, 2009, R package version 0.1. 

[20] H. Li and J. Gui, Gradient Directed Regular ization for Sparse Gaussian Concentration Graphs, 
with Applications to Inference of Genetic Networks, Biostatistics 7 (2008), no. 2, 302-317. 

[21] S. Ma, Q. Gong, and H. J. Bohnert, An Arabidopsis Gene Network Based on the Graphical 
Gaussian Model, Genome Research 17 (2007), 1614-1625. 

[22] N. Meinshausen and P. Biihlmann, High Dimensional Graphs and Variable Selection with the 
Lasso, Annals of Statistics 34 (2006), no. 3, 1436-1462. 

[23] R. Opgen-Rhein and K. Strimmer., longitudinal: Analysis of multiple time course data, 2008, 
R package version 1.1.4. 

[24] J.-P. Pellet and A. Elisseeff, A partial correlation-based algorithm for causal structure discov- 
ery with continuous variables, Advances in Intelligent Data Analysis VII, 7th International 
Symposium on Intelligent Data Analysis (M. R. Berthold, J. Shawe- Taylor, and N. Lavrac, 
eds.), 2007, pp. 229-239. 

[25] V. Pifmr, S. Datta, and S. Datta, Reconstruction of Genetic Association Networks from Mi- 
croarray Data, Bioinformatics 24 (2008), no. 4, 561-568. 

[26] C. Rangel, J. Angus, Z. Ghahramani, M. Lioumi, E. Sotheran, A. Gaiba, D.L. Wild, and 
F. Falciani, Modeling T-cell Activation using Gene Expression Profiling and State-Space Mod- 
els, Bioinformatics 20 (2004), 1361-1372. 

[27] A. Reverter and E.K.F Chan, Combining Partial Correlation and an Information Theory 
Approach to the Reversed- engineering of Gene Co-expression Networks, Bioinformatics (2008), 
doi : 1 . 1 093 /bioinformatics /btn482 . 

[28] R. Rosipal and N. Kramer, Overview and Recent Advances in Partial Least Squares, Sub- 
space, Latent Structure and Feature Selection Techniques, Lecture Notes in Computer Science, 
Springer, 2006, pp. 34-51. 

[29] R. Rosipal and L.J. Trejo, Kernel Partial Least Squares Regression in Reproducing Kernel 
Hilbert Spaces, Journal of Machine Learning Research 2 (2001), 97-123. 

[30] A. Rothman, P. Bickel, E. Levina, and J. Zhu, Sparse Permutation Invariant Covariance 
Estimation, Electronic Journal of Statistics 2 (2008), 494-515. 

[31] M. Ruschhaupt, Erzeugung von positiv definiten matrizen mit nebenbedingungen zur vali- 
dierung von netzwerkalgorithmen fur microarray-daten, Ph.D. thesis, University of Munich, 
2008. 



17 



[32] Y. Saeys, I. Inza, and P. Larranaga, A review of feature selection techniques in bioinformatics, 
Bioinformatics 23 (2007), no. 19, 2507. 

[33] H. Saigo, N. Kramer, and K. Tsuda, Partial Least Squares Regression for Graph Mining, 
14th International Conference on Knowledge Discovery and Data Mining (KDD2008), 2008, 
pp. 578-586. 

[34] J. Schafer, Comments on: Augmenting the Bootstrap to Analyze High Dimensional Genomic 
Data, TEST 17 (2008), no. 1, 28-30. 

[35] J. Schafer, R. Opgen-Rhein, and K. Strimmer, Reverse Engineering Genetic Networks using 
the GeneNet Package, R News 5/6 (2006), 50-53. 

[36] J. Schafer and K. Strimmer, A Shrinkage Approach to Large-Scale Covariance Matrix Esti- 
mation and Implications for Functional Genomics, Statistical Applications in Genetics and 
Molecular Biology 4 (2005), 32. 

[37] J. Schafer and K. Strimmer, An Empirical Bayes Approach to Inferring Large-Scale Gene 
Association Networks, Bioinformatics 21 (2005), 754-764. 

[38] W. Schmidt-Heck, R. Guthke, S. Toepfer, H. Reischer, K. Duerrschmid, and K. Bayer, Reverse 
engineering of the stress response during expression of a recombinant protein, EUNITE 2004 
European Symposium on Intelligent Technologies, Hybrid Systems and their Implementation 
on Smart Adaptive Systems, 2004, pp. 407-441. 

[39] J. E. Schmitt, R. K. Lenroot, G. L. Wallace, S. Ordaz, K. N. Taylor, N. Kabani, D. Greenstein, 
J. P. Lerch, K. S. Kendler, M. C. Neale, and J. N. Giedd, Identification of Genetically Mediated 
Cortical Networks: A Multivariate Study of Pediatric Twins and Siblings, Cerebral Cortex 18 
(2008), no. 8, 1737-1747. 

[40] S.M. Smith, D.C. Fulton, T. Chia, D. Thorneycroft, A. Chappie, H. Dunstan, C. Hylton, S.C. 
Zeeman, and A.M. Smith, Diurnal Changes in the Transcriptome Encoding Enzymes of Starch 
Metabolism Provide Evidence for Both Transcriptional and Posttranscriptional Regulation of 
Starch Metabolism in Arabidopsis Leaves 1, Plant Physiology 136 (2004), no. 1, 2687-2699. 

[41] K. Strimmer, A Unified Approach to False Discovery Rate Estimation, BMC Bioinformatics 9 
(2008), 303. 

[42] , Comments on: Augmenting the Bootstrap to Analyze High Dimensional Genomic 

Data, TEST 17 (2008), no. 1, 25-27. 

[43] , fdrtool: a versatile R package for estimating local and tail area-based false discovery 

rates, Bioinformatics 24 (2008), 1461-1462. 

[44] A. Tenenhaus, V. Guillemot, X. Gidrol, and V. Frouin, Gene Association Networks from Mi- 
croarray Data using a Regularized Estimation of Partial Correlation based on PLS Regression, 
IEEE Transactions on Computational Biology and Bioinformatics to appear (2008). 

[45] R. Tibshirani, Regression Shrinkage and Selection via the Lasso, Journal of the Royal Statis- 
tical Society, Series B 58 (1996), no. 1, 267-288. 



18 



[46] S. Tyekucheva and F. Chiaromonte, Augmenting the Bootstrap to Analyze High Dimensional 
Genomic Data, TEST 17 (2008), no. 1, 1-18. 



[47] M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang, H. Zuzan, J. A. Olson Jr, 
J.R. Marks, and J.R. Nevins, Predicting the Clinical Status of Human Breast Cancer by using 
Gene Expression Profiles, Proceedings of the National Academy of Sciences 98 (2001), no. 2, 
11462-11467. 

[48] J. Whittaker, Graphical Models in Applied Multivariate Statistics, Wiley New York, 1990. 

[49] D.M. Witten and R. Tibshirani, Covariance-regularized regression and and classification for 
high- dimensional problems, Journal of Royal Statistical Society, Series B, to appear. 

[50] H. Wold, Path Models with Latent Variables: The NIPALS Approach, Quantitative Sociology: 
International Perspectives on Mathematical and Statistical Model Building (H. M. Blalock 
et al., ed.), Academic Press, 1975, pp. 307-357. 

[51] S. Wold, H. Ruhe, H. Wold, and W. J. Dunn, III, The Collinearity Problem in Linear Re- 
gression. The Partial Least Squares (PLS) Approach to Generalized Lnverses, SIAM Journal 
of Scientific and Statistical Computations 5 (1984), 735-743. 

[52] M. K. S. Yeung, J. Tegner, and J .J. Collins, Reverse Engineering Gene Networks using 
Singular Value Decomposition and Robust Regression, Proceedings of the National Academy 
of Sciences 99 (2002), no. 9, 6163-6168. 

[53] M. Yuan, Efficient Computation of l\ Regularized Estimates in Gaussian Graphical Models, 
Journal of Computational and Graphical Statistics 17 (2008), no. 4, 809-826. 

[54] M. Yuan and Y. Lin, Model Selection and Estimation in the Gaussian Graphical Model, 
Biometrika 94 (2007), no. 1, 19-35. 

[55] S. Zhou, S. van de Geer, and P. Buhlmann, Adaptive lasso for high dimensional regression and 
gaussian graphical modeling, Preprint, arXiv:0903. 2515^1 (2009). 

[56] H. Zou, The Adaptive Lasso and its Oracle Properties, Journal of the American Statistical 
Association 101 (2006), no. 476, 1418-1429. 



19 



A Figures 



Figure la - MSE, number of edges, power and TDR for a density of 0.05 

Mean squared error, number of selected edges, power and true discovery rate (TDR) for the various 
methods PLS, ridge regression, the approach based on shrinkage covariance estimation, Lasso, and 
adaptive Lasso. 
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Figure lb - MSE, number of edges, power and TDR for a density of 0.10 



Mean squared error, number of selected edges, power and true discovery rate (TDR) for the various 
methods PLS, ridge regression, the approach based on shrinkage covariance estimation, Lasso, and 
adaptive Lasso. 
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Figure lc - MSE, number of edges, power and TDR for a density of 0.15 



Mean squared error, number of selected edges, power and true discovery rate (TDR) for the various 
methods PLS, ridge regression, the approach based on shrinkage covariance estimation, Lasso, and 
adaptive Lasso. 
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Figure Id - MSE, number of edges, power and TDR for a density of 0.20 



Mean squared error, number of selected edges, power and true discovery rate (TDR) for the various 
methods PLS, ridge regression, the approach based on shrinkage covariance estimation, Lasso, and 
adaptive Lasso. 



0.2 




50 100 150 
sample size 



200 



co 
CD 
O) 
T3 
CD 

ctf 
o 



co 



o 
o 



o 
o 

CM 



O - 



50 100 150 
sample size 



200 



— •— shrinkage 




-o lasso 




a adalasso 




v - pis 




-e— ridge 







50 100 150 
sample size 



200 



co oo 



CD 

> 
o 
o 
co 

CD 



Xt 
O 



O 
O 







A- A 





















50 100 150 
sample size 



200 



23 



Figure le - MSE, number of edges, power and TDR for a density of 0.25 



Mean squared error, number of selected edges, power and true discovery rate (TDR) for the various 
methods PLS, ridge regression, the approach based on shrinkage covariance estimation, Lasso, and 
adaptive Lasso. 
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Figure 2 - Runtime of the respective methods 
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Figure 3a - Network topology: 1 cluster 
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Figure 3b - Network topology: 2 clusters 
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Figure 3c - Network topology: 3 clusters 
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Figure 3d - Network topology: 3 stars 
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Figure 4a - Proportion of selected edges 
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Figure 4b - Proportion of selected edges without PLS 
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Figure 5a - Connectivity: Proportion of connected genes 
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Figure 5b - Connectivity: Proportion of connected genes without PLS 
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Figure 6 - Percentage of positive correlations 
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B Tables 



Table 1 - Overview of the methods 

Characteristics of the five methods considered in this study: (1) shrinkage estimator of the covari- 
ance (shrink), (2) PLS regression (pis), (3) Ridge Regression (ridge), (4) Lasso regression (lasso), 
(5) Adaptive Lasso (adalasso). 2nd column: Type of the method (inversion of a regularized esti- 
mate of the covariance matrix or regression-type method). 3rd column: Parameter determining 
the amount of regularization. 4th column: Method used to choose this(these) parameter (s). 5th 
column: Method used to decide whether two genes should be edge-connected. 



Method 


type 


parameter(s) 


choice 


edge if 


shrink 


regularized estimation 
of the covariance 


shrinkage intensity A 


analytic 


fdr< 0.2 


pis 


regression 


number of components m 


CV 


fdr< 0.2 


ridge 


regression 


penalty A 


CV 


fdr< 0.2 


lasso 


regression 


penalty A 


CV 


Pij + o 


adalasso 


regression 


penalty A (x2) 


nested CV (x2) 


Pij + o 



Table 2 - Size of the data sets 



data set 


arrays 


genes 


time series 


repeated 
measurements 


size of 
full graph 


Availability 


ecolil 


23 


100 


yes 


no 


4 950 


R package plsgenomics [3] 


ecoli2 


9 


102 


yes 


no 


5151 


R package GeneNet [35] 


ara 


22 


800 


yes 


no 


319 600 


R package GeneNet 


t . celllO 


100 


58 


yes 


yes 


1653 


R package longitudinal [23] 


t . cell34 
west 


340 
49 


58 
3883 


yes 
no 


yes 
no 


1653 
7536 903 


R package longitudinal 

Ihttp: / /strimmer lab.org/data.htmll 
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Table 3 - Overlap of the estimated graphs 

Example: On the ecolil data set, 68,6% of the edges found by Ridge Regression are also found 
by PLS. For baseline comparison, the numbers in italics show the percentage of selected edges for 
the respective methods. 
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n 4^7 
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ecolil 
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1 nnn 
1 .uuu 
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U.Ool 


u. 100 
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1.000 
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Dill 1111S. 
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1.000 


1.000 


1.000 


1.000 




% selected 


0.109 


tf.027 


0.575 


0.411 


0.011 




pis 


1.000 


0.053 


0.762 


0.670 


0.031 




ridge 


1.000 


1.000 


1.000 


1.000 


0.583 


t.cell34 


lasso 


0.345 


0.024 


1.000 


0.643 


0.014 




adalasso 


0.433 


0.034 


0.917 


1.000 


0.020 




shrink 


1.000 


1.000 


1.000 


1.000 


1.000 




% selected 


0.134 


0.005 


0.284 


0.221 


0.004 




pis 


1.000 


0.089 


0.017 


0.008 


0.041 




ridge 


0.667 


1.000 


0.118 


0.062 


0.236 


west 


lasso 


0.643 


0.611 


1.000 


0.407 


0.404 




adalasso 


0.673 


0.694 


0.884 


1.000 


0.458 




shrink 


0.632 


0.488 


0.161 


0.084 


1.000 




% selected 


0.086 


0.011 


0.002 


0.001 


0.006 
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Table 4 - Stability of the Methods 



For the data sets ecolil, ecoli2, t.celllO, t.cell34 and west, we display Fleiss' kappa. The 
quantity is always < 1, and the higher the value, the more stable is the method. 



data set 


measure 


pis 


ridge 


lasso 


adalasso 


shrink 


ecolil 




K 


0.630 


0.510 


0.550 


0.550 


0.593 




ranking 


: of k 


1 


5 


3.5 


3.5 


2 


ecoli2 




K 


0.242 


0.280 


0.469 


0.450 


0.486 




ranking 


; of k 


5 


4 
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3 
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t.celllO 
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0.656 


0.797 


0.670 


0.674 


0.742 




ranking 
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5 
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t.cell34 
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0.655 
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0.625 


0.619 


0.702 




ranking 


; of k 


2 
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3 


4 


1 




mean ranking 


: of k 


3.25 


3.75 


3.125 


3.375 


1.5 
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C Supplementary Material 



Figure 2a - ROC curves for a density of 0.05 

ROC curves obtained by varying the fdr-threshold for PLS, Ridge Regression and Shrinkage. The 
sensitivity and specificity of Lasso and Adaptive Lasso are represented by a point. 
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Figure 2b - ROC curves for a density of 0.10 



ROC curves obtained by varying the fdr-threshold for PLS, Ridge Regression and Shrinkage. The 
sensitivity and specificity of Lasso and Adaptive Lasso are represented by a point. 
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Figure 2c - ROC curves for a density of 0.15 



ROC curves obtained by varying the fdr-threshold for PLS, Ridge Regression and Shrinkage. The 
sensitivity and specificity of Lasso and Adaptive Lasso are represented by a point. 
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Supplement 2d - ROC curves for a density of 0.20 



ROC curves obtained by varying the fdr-threshold for PLS, Ridge Regression and Shrinkage. The 
sensitivity and specificity of Lasso and Adaptive Lasso are represented by a point. 
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Supplement 2e - ROC curves for a density of 0.25 



ROC curves obtained by varying the fdr-threshold for PLS, Ridge Regression and Shrinkage. The 
sensitivity and specificity of Lasso and Adaptive Lasso are represented by a point. 
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Supplement 3 - Histogram of partial correlations for different degrees of density 
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